test: Add unit tests to test multiple files in single dataset #412

Abhishek-TAMU · 2024-12-10T21:36:48Z

Description of the change

1- Added unit test case for verifying handling of multiple data files of 1 DataSetConfig passed via data_config.
2- Added unit test case for verifying handling of multiple data files of 1 DataSetConfig with (different format) passed via data_config.
3- Added unit test case for verifying handling of multiple data files of 1 DataSetConfig with (different type) passed via data_config.
4- Added unit test case (by @willmj) for verifying handling of multiple data files of multiple DataSetConfig with (different format) passed via data_config.
5- Data files in the tests/artifacts/testdata directory have been organized by file format for better categorization.
6- Unit tests in test_sft_trainer.py for e2e testing:

Existing Full FT unit tests with Arrow and Parquet Datafile format
Add new Full FT unit test with Multiple Datasets with Multiple files

Related issue number

https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1487

How to verify the PR

Verify unit test additions.

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

github-actions · 2024-12-10T21:37:01Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

github-actions · 2024-12-10T21:37:02Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

github-actions · 2024-12-10T21:37:05Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

willmj

This looks good to me, I like splitting test data into different folders by type. Have a couple of comments below. I'm interested to hear Dushyant's thoughts before merging

willmj · 2024-12-11T19:14:53Z

tests/test_sft_trainer.py

+            [
+                TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSONL,
+                TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSONL,
+            ],


Would there be a benefit in using multiple types of data here such as JSONL for one dataset, Arrow for another, and Parquet for a third?

May not be necessary, Good to have it though. Added!

Has this been added? I do not see a mix of the different type of data @willmj asked here.

Yes this was added. I can see it in commit changes here. Let me know of other changes required in this unit test.

willmj · 2024-12-11T19:24:52Z

tests/data/test_data_preprocessing_utils.py

+        ),
+    ],
+)
+def test_process_dataconfig_multiple_files(data_config_path, list_data_path):


Might be worth adding a test with three files just in case

As this test case already have multiple cases, added case with 3 files in same unit test for all 3 handlers.

We can reduce the test cases number and possibly just have

Mix of all three -> 1 test

each dataset multiple files, either 2 or three, maybe in a random mix

I think 3-4 scenrios should be fine, rest are anyway similar

Ah I see there is one with varied data formats below...so maybe just a reduction of number of tests here could also work.

I have optimized the unit tests based on below comments. If we still need to optimize more do let me know here. @dushyantbehl

willmj · 2024-12-11T19:26:42Z

tuning/data/data_handlers.py

+    if dataset_text_field not in element:
+        raise KeyError(f"Dataset should contain {dataset_text_field} field.")


Thanks for this!

Thanks for the catch!

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Abhishek-TAMU · 2024-12-11T21:01:55Z

tests/data/test_data_preprocessing_utils.py

+        ),
+    ],
+)
+def test_process_dataconfig_multiple_datasets_datafiles(datafiles, datasetconfigname):


This test case looks good. Updated the description. Thanks @willmj

Could this test be combined with this one?

fms-hf-tuning/tests/data/test_data_preprocessing_utils.py

Line 782 in 4168c87

def test_process_dataset_configs_with_sampling(datafiles, datasetconfigname):

Make sense. Combined into test_process_dataconfig_multiple_datasets_datafiles_sampling.

dushyantbehl · 2024-12-12T17:11:24Z

.pylintrc

@@ -333,7 +333,7 @@ indent-string='    '
 max-line-length=100

 # Maximum number of lines in a module.
-max-module-lines=1200
+max-module-lines=1400


Seems like we will keep hitting this. I also had to disable this specifically for test_sft_trainer.py

https://github.com/foundation-model-stack/fms-hf-tuning/pull/409/files#diff-406c3c4d02c1ff158009b04d26a6b2d357f05d11e8c35c3f2ddfc1b54649a022R18

Sounds good. Removed it after merging in the change from the mentioned PR.

dushyantbehl · 2024-12-12T17:13:15Z

tests/artifacts/testdata/__init__.py

@@ -19,37 +19,47 @@

 ### Constants used for data
 DATA_DIR = os.path.join(os.path.dirname(__file__))
+JSON_DATA_DIR = os.path.join(os.path.dirname(__file__), "json")


Thanks for doing this segregation.

dushyantbehl · 2024-12-12T17:17:06Z

tests/data/test_data_preprocessing_utils.py

@@ -491,6 +497,284 @@ def test_process_dataconfig_file(data_config_path, data_path):
        assert formatted_dataset_field in set(train_set.column_names)


+@pytest.mark.parametrize(
+    "data_config_path, list_data_path",


nit: could we change the variable name list_data_path to data_path_list

dushyantbehl · 2024-12-12T18:37:31Z

tests/data/test_data_preprocessing_utils.py

+        ),
+    ],
+)
+def test_process_dataconfig_multiple_files_varied_types(


Can this be combined with the above test?

Yea, this is added! Thank you!

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Abhishek-TAMU · 2024-12-13T00:09:42Z

Thank you @dushyantbehl for the review. Have addressed the changes. Feel free to have a look once again.

dushyantbehl

LGTM @Abhishek-TAMU

willmj

LGTM

test: Add unit tests to test multiple files in single dataset

ce82af1

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

github-actions bot added the test label Dec 10, 2024

e2e testing unit test for multiple datasets with multiple files

3fe7425

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Abhishek-TAMU marked this pull request as ready for review December 11, 2024 18:21

Abhishek-TAMU requested review from anhuong, Ssukriti, aluu317, fabianlim and kmehant as code owners December 11, 2024 18:21

willmj previously approved these changes Dec 11, 2024

View reviewed changes

test: multiple datasets with multiple datafiles column names

e89002d

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj dismissed their stale review via e89002d December 11, 2024 20:48

PR changes

4ba1c04

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Abhishek-TAMU commented Dec 11, 2024

View reviewed changes

Abhishek-TAMU mentioned this pull request Dec 12, 2024

test: Add support and unit tests for handling of multiple files passed as a pattern in data_config #416

Closed

2 tasks

dushyantbehl reviewed Dec 12, 2024

View reviewed changes

Abhishek-TAMU added 3 commits December 12, 2024 18:37

PR Changes

3fce172

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

fix: fmt

68a0f50

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Merge test_process_dataconfig_multiple_files_varied_data_formats

5905e23

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Merge remote-tracking branch 'upstream/main' into multiple_files

6f13d9a

Abhishek-TAMU requested review from dushyantbehl and willmj December 13, 2024 04:03

willmj mentioned this pull request Dec 13, 2024

feat: data folder processing in datapreprocessor #417

Closed

2 tasks

dushyantbehl approved these changes Dec 13, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into multiple_files

83d0127

Abhishek-TAMU requested a review from dushyantbehl December 13, 2024 16:26

willmj approved these changes Dec 13, 2024

View reviewed changes

Abhishek-TAMU merged commit 4441948 into foundation-model-stack:main Dec 13, 2024
8 checks passed

Abhishek-TAMU deleted the multiple_files branch December 13, 2024 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Add unit tests to test multiple files in single dataset #412

test: Add unit tests to test multiple files in single dataset #412

Abhishek-TAMU commented Dec 10, 2024 •

edited

Loading

github-actions bot commented Dec 10, 2024

github-actions bot commented Dec 10, 2024

github-actions bot commented Dec 10, 2024

willmj left a comment

willmj Dec 11, 2024

Abhishek-TAMU Dec 11, 2024

dushyantbehl Dec 12, 2024

Abhishek-TAMU Dec 12, 2024

willmj Dec 11, 2024

Abhishek-TAMU Dec 11, 2024

dushyantbehl Dec 12, 2024

dushyantbehl Dec 12, 2024

Abhishek-TAMU Dec 13, 2024

willmj Dec 11, 2024

dushyantbehl Dec 12, 2024

Abhishek-TAMU Dec 11, 2024

dushyantbehl Dec 12, 2024

Abhishek-TAMU Dec 12, 2024

dushyantbehl Dec 12, 2024

Abhishek-TAMU Dec 12, 2024 •

edited

Loading

dushyantbehl Dec 12, 2024 •

edited

Loading

dushyantbehl Dec 12, 2024

dushyantbehl Dec 12, 2024

Abhishek-TAMU Dec 12, 2024

Abhishek-TAMU commented Dec 13, 2024

dushyantbehl left a comment

willmj left a comment

		if dataset_text_field not in element:
		raise KeyError(f"Dataset should contain {dataset_text_field} field.")

test: Add unit tests to test multiple files in single dataset #412

test: Add unit tests to test multiple files in single dataset #412

Conversation

Abhishek-TAMU commented Dec 10, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Dec 10, 2024

github-actions bot commented Dec 10, 2024

github-actions bot commented Dec 10, 2024

willmj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abhishek-TAMU Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

dushyantbehl Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abhishek-TAMU commented Dec 13, 2024

dushyantbehl left a comment

Choose a reason for hiding this comment

willmj left a comment

Choose a reason for hiding this comment

Abhishek-TAMU commented Dec 10, 2024 •

edited

Loading

Abhishek-TAMU Dec 12, 2024 •

edited

Loading

dushyantbehl Dec 12, 2024 •

edited

Loading